This document explores a dataset containing metadata related to loans such as:
Additional variables provided in the data dictionary, Prosper Loan Data# pip install <package> via windows
#!pip install or !apt-get install for google colab
#!pip install modin
# import all packages and set plots to be embedded inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
#import modin.pandas as pd
import pandas as pd
import matplotlib.ticker as ticker
%matplotlib inline
# from google.colab import drive
# drive.mount('/content/drive')
# #Google colab read csv
# loans_raw = pd.read_csv('/content/drive/My Drive/Colab Notebooks/prosperLoanData.csv')
# Local drive read csv
loans_raw = pd.read_csv('prosperLoanData.csv')
# reload loans_df with the relevant columns, loans_raw contains the original
loans_df = loans_raw.copy()
loans_df.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
# FUNCTION TO PLOT VALS ALONG X-AXIS, NOMINAL VALS ON Y
def yplot_values(gx, form):
initialx = 0
for g in gx.patches:
gx.text(g.get_width(), initialx + g.get_height()/4,
form.format(g.get_width()),
color='black',
ha="left")
initialx += 1
# FUNCTION TO PLOT VALS ALONG Y-AXIS, NOMINAL VALS ON X
def xplot_values(gy, form):
initialy = 0
# Logic to print the proportion text on the bars
for g in gy.patches:
gy.text(initialy + g.get_width()/13, g.get_height(),
form.format(g.get_height()),
color='black',
ha='center', # 'center', 'right', 'left'
va='bottom') # 'top', 'bottom', 'center', 'baseline', 'center_baseline'
initialy += 1
There are 113,937 entries, across 81 variables (as seen above) in this loans dataset. Below are the variables of interest during this exploration.
loans_raw.shape # review shape of dataset
(113937, 81)
loans_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
# create list
subset = ['EmploymentStatus', 'BorrowerState', 'Occupation',
'CreditGrade', 'ProsperRating (Alpha)', 'ProsperRating (numeric)',
'IncomeRange', 'Term', 'CurrentDelinquencies', 'BorrowerRate', 'DebtToIncomeRatio',
'ListingCreationDate', 'ClosedDate', 'MonthlyLoanPayment']
subset, print('Number of Target Columns: {}\n'.format(len(subset)))
Number of Target Columns: 14
(['EmploymentStatus', 'BorrowerState', 'Occupation', 'CreditGrade', 'ProsperRating (Alpha)', 'ProsperRating (numeric)', 'IncomeRange', 'Term', 'CurrentDelinquencies', 'BorrowerRate', 'DebtToIncomeRatio', 'ListingCreationDate', 'ClosedDate', 'MonthlyLoanPayment'], None)
# Check for duplicate data
loans_raw[subset].duplicated().value_counts()
False 113066 True 871 dtype: int64
Before continuing any further only columns of interest will be retained for the exploration to ensure only the required variables are modified.
# size before modification
loans_df.shape
(113937, 81)
# reload loans_df with the relevant columns, loans_raw contains the original
loans_df = loans_raw.loc[:, subset].copy()
loans_df.head()
| EmploymentStatus | BorrowerState | Occupation | CreditGrade | ProsperRating (Alpha) | ProsperRating (numeric) | IncomeRange | Term | CurrentDelinquencies | BorrowerRate | DebtToIncomeRatio | ListingCreationDate | ClosedDate | MonthlyLoanPayment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Self-employed | CO | Other | C | NaN | NaN | $25,000-49,999 | 36 | 2.0 | 0.1580 | 0.17 | 2007-08-26 19:09:29.263000000 | 2009-08-14 00:00:00 | 330.43 |
| 1 | Employed | CO | Professional | NaN | A | 6.0 | $50,000-74,999 | 36 | 0.0 | 0.0920 | 0.18 | 2014-02-27 08:28:07.900000000 | NaN | 318.93 |
| 2 | Not available | GA | Other | HR | NaN | NaN | Not displayed | 36 | 1.0 | 0.2750 | 0.06 | 2007-01-05 15:00:47.090000000 | 2009-12-17 00:00:00 | 123.32 |
| 3 | Employed | GA | Skilled Labor | NaN | A | 6.0 | $25,000-49,999 | 36 | 4.0 | 0.0974 | 0.15 | 2012-10-22 11:02:35.010000000 | NaN | 321.45 |
| 4 | Employed | MN | Executive | NaN | D | 3.0 | $100,000+ | 36 | 0.0 | 0.2085 | 0.26 | 2013-09-14 18:38:39.097000000 | NaN | 563.97 |
# convert to correct data types
# dates
toDates = ['ListingCreationDate', 'ClosedDate'] # date mask
loans_df[toDates] = loans_df[toDates].astype('datetime64')
# check datatypes are now correct
loans_df[toDates].dtypes
ListingCreationDate datetime64[ns] ClosedDate datetime64[ns] dtype: object
# Extract year and month as ProsperRating and CreditGrade vary with year and month
loans_df['ListingCreationYear'] = pd.DatetimeIndex(
loans_df.ListingCreationDate).year
loans_df['ListingCreationMonth'] = pd.DatetimeIndex(
loans_df.ListingCreationDate).month
loans_df['ClosedDateYear'] = pd.DatetimeIndex(loans_df.ClosedDate).year
loans_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmploymentStatus 111682 non-null object 1 BorrowerState 108422 non-null object 2 Occupation 110349 non-null object 3 CreditGrade 28953 non-null object 4 ProsperRating (Alpha) 84853 non-null object 5 ProsperRating (numeric) 84853 non-null float64 6 IncomeRange 113937 non-null object 7 Term 113937 non-null int64 8 CurrentDelinquencies 113240 non-null float64 9 BorrowerRate 113937 non-null float64 10 DebtToIncomeRatio 105383 non-null float64 11 ListingCreationDate 113937 non-null datetime64[ns] 12 ClosedDate 55089 non-null datetime64[ns] 13 MonthlyLoanPayment 113937 non-null float64 14 ListingCreationYear 113937 non-null int64 15 ListingCreationMonth 113937 non-null int64 16 ClosedDateYear 55089 non-null float64 dtypes: datetime64[ns](2), float64(6), int64(3), object(6) memory usage: 14.8+ MB
There are ratings we were recorded pre 2009 and post 2009.
CreditGrade and ProsperRating, these two are pairs because they cannot be analysed separately due to each rating type being based on a specific time period, where the former was used for users pre 2009 and the latter was for post 2009. Quickly reviewing the two columns reveals there are Nan evident on either field whilst the other contains an entry. As such the dataframe will contain a new column combining the two, specifically the post 2009 overwritting the Nan values.
# combine pre 2009 followed by post 2009
loans_df['CreditRating'] = loans_df['CreditGrade'].fillna(
loans_df['ProsperRating (Alpha)'])
A dictionary will be created to assign and apply the relevant numerical values to the
# create a dictionary to assign key value pairs
ratings_dict = {
'AA': 7,
'A': 6,
'B': 5,
'C': 4,
'D': 3,
'E': 2,
'HR': 1,
'Nan': None,
}
# assign value to new column
loans_df['ProsperRatingNum'] = loans_df['CreditRating'].map(ratings_dict)
loans_df['ProsperRatingNum'].dtypes
dtype('float64')
loans_df['ProsperRatingNum'] = pd.to_numeric(
loans_df['ProsperRatingNum'], downcast='float', errors='coerce')
loans_df['ProsperRatingNum'].dtypes
dtype('float32')
# filter employmentstatus and income range
loans_df = loans_df[(loans_df.EmploymentStatus != 'Not available') & (
loans_df.IncomeRange != 'Not displayed')]
Having Not Available and Income Displayed is redundant information that doesn't assist in the analysis of what affects BorrowRate.
With the Employment Status analysed, the next breakdown from Employment are the Occupations.
Income range requires cleaning as evident with \$0 and Not employed values.
loans_df.IncomeRange.value_counts(dropna=False)
$25,000-49,999 32192 $50,000-74,999 31050 $100,000+ 17337 $75,000-99,999 16916 $1-24,999 7274 Not employed 806 $0 621 Name: IncomeRange, dtype: int64
loans_df.IncomeRange = loans_df.IncomeRange.replace(
['Not employed'], '$0') # replace incorrect data
loans_df.IncomeRange = loans_df.IncomeRange.replace(['Not displayed'], None)
# Find ordinal variables and set order from lowest (left) to highest (right)
ordinal_dict = {'CreditGrade': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'CreditRating': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'IncomeRange': ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+']
}
# Udacity code snippet to replace Column in existing DF with newly configured order (from above)
for var in ordinal_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered=True,
categories=ordinal_dict[var])
loans_df[var] = loans_df[var].astype(ordered_var)
loans_df['CreditGrade'].dtypes
CategoricalDtype(categories=['HR', 'E', 'D', 'C', 'B', 'A', 'AA'], ordered=True)
loans_df['IncomeRange'].dtypes
CategoricalDtype(categories=['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999',
'$75,000-99,999', '$100,000+'],
, ordered=True)
# separation of variables
numbers = ['ProsperRatingNum', 'Term', 'CurrentDelinquencies',
'BorrowerRate', 'DebtToIncomeRatio', 'MonthlyLoanPayment']
categories = ['EmploymentStatus', 'BorrowerState',
'Occupation', 'CreditRating', 'CreditGrade', 'IncomeRange']
First visual is to grasp the amount of NaN present in the current dataset to determine whether additional cleaning is required.
# create empty list to hold series data, information is held in .index (column names) and .values (number of NaN per column)
Nan_sublist_all = []
Nan_sublist_all = loans_df.isna().sum()
Nan_sublist_all.sort_values(ascending=False)
CreditGrade 84984 ClosedDateYear 58848 ClosedDate 58848 ProsperRating (Alpha) 21343 ProsperRating (numeric) 21343 DebtToIncomeRatio 8430 BorrowerState 1702 Occupation 1333 ProsperRatingNum 131 CreditRating 131 CurrentDelinquencies 20 ListingCreationYear 0 ListingCreationMonth 0 EmploymentStatus 0 MonthlyLoanPayment 0 ListingCreationDate 0 Term 0 IncomeRange 0 BorrowerRate 0 dtype: int64
# List above contains unnecessary entries limiting viewing.
# Add equality test to narrow down to column names with
Nan_sublist = Nan_sublist_all[Nan_sublist_all.values > 0]
Nan_sublist.shape
(11,)
# explicit figure and figsize declaraction
fig, axes = plt.subplots(1, 1)
sb.despine(fig)
# plot missing values in descending order
g0 = sb.barplot(x=Nan_sublist.values, y=Nan_sublist.index, ax=axes,
color='red', order=Nan_sublist.sort_values(ascending=False).index)
yplot_values(g0, '{:1.0f}') # plot y values onto graph
fig.suptitle('Number of NA values in each Variable', fontsize=18);
loans_df.shape # review size prior to dropping Nan
(106196, 19)
Nan_sublist.sort_values(ascending=False)[4:].index
Index(['ProsperRating (numeric)', 'DebtToIncomeRatio', 'BorrowerState',
'Occupation', 'CreditRating', 'ProsperRatingNum',
'CurrentDelinquencies'],
dtype='object')
# slice object, convert to list and check datatype
subset_drop = Nan_sublist.sort_values(ascending=False)[4:].index.tolist()
subset_drop, type(subset_drop)
(['ProsperRating (numeric)', 'DebtToIncomeRatio', 'BorrowerState', 'Occupation', 'CreditRating', 'ProsperRatingNum', 'CurrentDelinquencies'], list)
# drop Nan values
loans_df.dropna(axis=0, subset=subset_drop, inplace=True)
loans_df.shape
(76224, 19)
The EDA will revolve around the exploration of the 14 variables mentioned previously.
loans_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 76224 entries, 1 to 113936 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmploymentStatus 76224 non-null object 1 BorrowerState 76224 non-null object 2 Occupation 76224 non-null object 3 CreditGrade 0 non-null category 4 ProsperRating (Alpha) 76224 non-null object 5 ProsperRating (numeric) 76224 non-null float64 6 IncomeRange 76224 non-null category 7 Term 76224 non-null int64 8 CurrentDelinquencies 76224 non-null float64 9 BorrowerRate 76224 non-null float64 10 DebtToIncomeRatio 76224 non-null float64 11 ListingCreationDate 76224 non-null datetime64[ns] 12 ClosedDate 23020 non-null datetime64[ns] 13 MonthlyLoanPayment 76224 non-null float64 14 ListingCreationYear 76224 non-null int64 15 ListingCreationMonth 76224 non-null int64 16 ClosedDateYear 23020 non-null float64 17 CreditRating 76224 non-null category 18 ProsperRatingNum 76224 non-null float32 dtypes: category(3), datetime64[ns](2), float32(1), float64(6), int64(3), object(4) memory usage: 9.8+ MB
# drop CreditGrade as it creates an additional row/column in the pair plot
loans_df.drop(labels='CreditGrade', axis=1, inplace=True)
# set 'n' samples
loans_samples = loans_df.sample(n=1000)
# pairplot
g0 = sb.pairplot(loans_samples, kind='scatter', dropna=True, height=2.5)
g0.fig.subplots_adjust(top=0.9)
g0.fig.suptitle('Comparison of Numerical Variables', fontsize=28);
Of the 500 random samples:
Of the quantitative variables above, we will investigate Prosper Score/Credit Grade, Borrowrate, MonthlyLoanPayment to clarify the correlations seen above against categorical variables that have not yet been considered. The heatmap below provides numeric evidence of the visual inspection of the pair grid.
# heatmap
g1 = sb.heatmap(loans_df.corr(), annot=True, fmt='.2f',
cmap='vlag_r', center=0)
plt.title('Numerical Correlations', fontsize=20);
What factors greatly affect the BorrowRate and as a result the MonthlyLoanPayment of the customer?
loans_df.BorrowerRate.describe().reset_index()
| index | BorrowerRate | |
|---|---|---|
| 0 | count | 76224.000000 |
| 1 | mean | 0.193624 |
| 2 | std | 0.074089 |
| 3 | min | 0.040000 |
| 4 | 25% | 0.134900 |
| 5 | 50% | 0.184500 |
| 6 | 75% | 0.254900 |
| 7 | max | 0.360000 |
# initialize
fig, axes = plt.subplots(1, 1, figsize=(10, 4))
sb.despine(fig)
# set ticks
minTickBorrow = np.arange(0, loans_df.BorrowerRate.max()+0.04, 0.02)
# graph 1
g0 = sb.violinplot(x=loans_df.BorrowerRate, color='Red', ax=axes) # graph 1
plt.setp(g0.collections, alpha=.6)
# graph 2
g0 = sb.boxplot(x=loans_df.BorrowerRate,
boxprops=dict(alpha=.8), ax=axes) # graph 2
g0.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
fig.suptitle('Borrower Rate Data Spread', fontsize=20);
Both graphs approximate the median BorrowRate to be 18%, both plots suggest outliers to be considered ~38% and above. This allows unless to optimize the bins to be between 0 and 0.4.
# initialize
fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
sb.despine(fig)
# graph 2 - distribution of borrow rates
g1 = sb.histplot(loans_df, x='BorrowerRate', element='bars', multiple="stack",
ax=axes[0])
# graph 3
axes[1].set(yscale="log") # scale y axis only
g2 = sb.histplot(loans_df, x='BorrowerRate', element='bars', multiple="stack",
# log_scale=True, # log scales both x & y
ax=axes[1])
g2.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
fig.suptitle('Borrower Rate Frequency', fontsize=16); # set title
BorrowRate variable shows a multi-modal normal distribution, one towards the lower end where the bulk of the loaners are within 10% to 20%. The second towards the higher end with a peak at 31%.
The log scale transformation reveals the outliers not seen in the prior 2 graphs above it.
With the distribution of primary variable explored. The remaining categorical variables that were not able to be analysed in the pair plot will be investigated to determine possibly influences with BorrowerRate.
loans_df.EmploymentStatus.value_counts()
Employed 65884 Full-time 7584 Other 2194 Retired 320 Part-time 199 Self-employed 42 Not employed 1 Name: EmploymentStatus, dtype: int64
# initialize
fig, axes = plt.subplots(3, 1, figsize=(16, 12))
sb.despine(fig)
# graph 1 - employment status
g0 = sb.barplot(x=loans_df.EmploymentStatus.value_counts().index,
y=loans_df.EmploymentStatus.value_counts().values, ax=axes[0])
xplot_values(g0, '{:1.0f}') # call func. add values to the chart
# graph 2 - distribution of borrow rates
g1 = sb.histplot(loans_df, x='BorrowerRate', binrange=(
0, 0.4), element='bars', hue='EmploymentStatus', multiple='stack', ax=axes[1])
g1.set(xticks=minTickBorrow, xlabel='Borrow Rate %',
yticks=np.arange(0, 5000, 500))
# graph 3 - log scale transformation of of graph 1 to assist in visualizing smaller values
axes[2].set(yscale="log")
g2 = sb.histplot(loans_df, x='BorrowerRate', binrange=(0, 0.4), element='bars',
hue='EmploymentStatus', multiple='stack', ax=axes[2], legend=False)
g2.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
plt.suptitle('BorrowerRate vs Employment', fontsize=20);
The univariate exploration of EmploymentStatus reveals majority of the loan applicants are 'Employed', which is redundant as it splits further into Full-time Employed, Self-Employed, Part-time, other etc meaning the data classifications were not considered appropriately. The Bivariate exploration against BorrowRate, reveals the spread of employment with generally revolving around 10 to 30% +/- 5%, with a large peak at ~32%. The log transformation provides clarity for the lower values not otherwise seen in graph 2, the borrow rate of full-time employees receiving ~0.4% is interesting. If they were incorrectly entered or due to the low overall amount required to be borrowed that led to such a low rate.
# initialize
sb.set_style("whitegrid", {'grid.linestyle': '--'})
fig, axes = plt.subplots(1, 1, figsize=(12, 7))
sb.despine(fig)
# prepare ticks
minTickBorrow = np.arange(0, loans_df.BorrowerRate.max()+0.1, 0.02)
# graph 4
g3 = sb.violinplot(x=loans_df.BorrowerRate,
y=loans_df.EmploymentStatus, ax=axes)
plt.setp(g3.collections, alpha=.8)
# graph 5
g4 = sb.boxplot(x=loans_df.BorrowerRate, y=loans_df.EmploymentStatus, ax=axes)
g4.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
# filter out employed and other
employment_mask = ~loans_df['EmploymentStatus'].isin(['Employed', 'Other'])
fig.suptitle('Employment Status', fontsize=20);
It is evident that BorrowerRate is affected by EmploymentStatus given the following: -Of the 7 employment types, the median borrow rate of Not employed is significantly higher then the remaining, as well a large majority of it's spread situated at the higher borrow rates due to the increased risk of the individual paying it back. -Self-employed & Full-time employees having the lowest medians -Employed & Other are generalised as discussed previously.
It is interesting to see Other has the widest spread at ~32%, with Employed having the second highest width. The remaining employment types seem to taper off at the same amount.
loans_df.Occupation.value_counts()
Other 18501
Professional 9917
Executive 3206
Computer Programmer 3038
Teacher 2777
...
Student - College Junior 19
Student - College Freshman 12
Student - College Sophomore 9
Student - Community College 5
Student - Technical School 1
Name: Occupation, Length: 67, dtype: int64
# subplots(row, col) , figsize(x,y)
fig, axes = plt.subplots(1, 1, figsize=(7, 12))
sb.despine(fig)
# axes.set(xscale="log")
g0 = sb.countplot(data=loans_df, y='Occupation',
order=loans_df.Occupation.value_counts().index)
# call function and plot values of graph
yplot_values(g0, '{:1.0f}')
fig.suptitle('Occupation Counts', fontsize=20);
# occupation mask
The above graph shows several types of Occupations, split into there appropriate position name. It is important to note the top two occupations are ambigious and uninformative i.e. Professionals and Other.
As there several Occupations to display, it will be quite cumbersome to display all the statistical distributions of each, example seen below.
fig, axes = plt.subplots(1, 1, figsize=(9, 12))
sb.despine(fig)
g1 = sb.boxplot(y=loans_df.Occupation, x=loans_df.BorrowerRate)
g1.set(xticks=np.arange(0, loans_df['BorrowerRate'].max(
)+0.04, 0.02), xlabel='Borrow Rate %') # preset variables
fig.subplots_adjust(top=0.9) # adjust offset
fig.suptitle('BorrowerRate vs Occupation', fontsize=20);
Using the box plot above, we can see a better capture of the Occupations relationship to BorrowerRate. It is particularly evident that Occupations do impact borrow rates as they are a subset of EmploymentStatus evident with Judge and Doctors appearing to have the lowest median borrow rates, where as Student College Freshman and Teachers aid appear to have the highest.
Due to the quantity of the Occupations list, a select few professions will be assessed to ensure a wide proportion of the population is captured. This selection is visual based with attempts to collect a low and high borrow rate of each field i.e. business, law, engineering, public service, admin, etc.
# initialize
fig, axes = plt.subplots(4, 1, figsize=(10, 8), sharex=True)
sb.despine(fig)
# filter by text
g2 = sb.boxplot(y=loans_df[loans_df['Occupation'].str.contains(pat='Trade', regex=False)].Occupation,
x=loans_df.BorrowerRate, ax=axes[0])
g3 = sb.boxplot(y=loans_df[loans_df['Occupation'].str.contains(pat='Student', regex=False)].Occupation,
x=loans_df.BorrowerRate, ax=axes[1])
g4 = sb.boxplot(y=loans_df[loans_df['Occupation'].str.contains(pat='Engineer', regex=False)].Occupation,
x=loans_df.BorrowerRate, ax=axes[2])
g5 = sb.boxplot(y=loans_df[loans_df['Occupation'].str.contains(pat='Retail', regex=False)].Occupation,
x=loans_df.BorrowerRate, ax=axes[3])
g5.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
fig.suptitle('Sample Occupations vs BorrowerRate', fontsize=20);
# sliced list omitting Professional and Other
key_occp_list = loans_df.Occupation.value_counts().index[2:11].to_list()
# custom selection of Occupations to incorporate one from each field of work
custom_occp_list = ['Police Officer/Correction Officer', 'Fireman',
'Attorney', 'Engineer - Mechanical',
'Doctor', 'Judge', 'Investor', 'Sales - Retail',
'Student - College Freshman', 'Student - College Graduate Student']
sample_occp_list = key_occp_list + custom_occp_list # append both lists into one
sample_occp_mask = loans_df.Occupation.isin(sample_occp_list)
# initialize
fig, axes = plt.subplots(1, 1, figsize=(12, 10))
sb.despine(fig)
g6 = sb.violinplot(
y=loans_df[sample_occp_mask].Occupation,
x=loans_df.BorrowerRate)
plt.setp(g6.collections, alpha=.7)
g7 = sb.boxplot(
y=loans_df[sample_occp_mask].Occupation,
x=loans_df.BorrowerRate,
boxprops=dict(alpha=.8))
g7.set(xticks=np.arange(0, loans_df.BorrowerRate.max() +
0.1, 0.02), xlabel='Borrow Rate %')
fig.suptitle('Sample list of Occupations vs BorrowerRate', fontsize=20);
Initially the two plots were on separate subplots which made comparing them side by side difficult. As such combining them onto the same plot allows for a better visual of the width of the violins, as well as the outliers, lower and upper quartiles shown by the box plot. It is apparent Occupations do impact the borrow rates issued out by Prosper, next variable to investigate is Income Range which directly ties to the amount each occupation is paid.
Prior to dropping Other and Professional occupations, an investigation into the spread of data will be performed to determine how it sits when compared against Borrower Rate.
loans_df.shape # taking note of current dataframe size
(76224, 18)
loans_df[sample_occp_mask].shape # the size based on the above filter
(28564, 18)
# the size based on Other and Professional
# get Occupations with Other and Professional
sample_misc_loans = loans_df[loans_df.Occupation.isin(
['Other', 'Professional'])]
sample_misc_loans.shape
(28418, 18)
# the size of the remaining excluding other and Professional
sample_remain_loans = loans_df[(loans_df.Occupation != 'Other') & (
loans_df.Occupation != 'Professional')]
sample_remain_loans.shape
(47806, 18)
# plot to show the distribution of Other and Professional occupations
# subplots(row, col)
fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
sb.despine(fig)
# graph 1 - distribution of borrow rates
g8 = sb.histplot(sample_misc_loans, # new sample data frame above
x='BorrowerRate', binrange=(0, 0.4), element='bars',
hue='Occupation', multiple='stack', ax=axes[0])
# graph 2 - log scale transformation of of graph 1 to assist in visualizing smaller values
axes[1].set(yscale="log")
g9 = sb.histplot(sample_misc_loans,
x='BorrowerRate', binrange=(0, 0.4), element='bars',
hue='Occupation', multiple='stack', ax=axes[1], legend=False)
g9.set(xticks=np.arange(
0, loans_df['BorrowerRate'].max()+0.04, 0.02), xlabel='Borrow Rate %')
# TODO - consolidate legend
fig.suptitle('BorrowRate for Other and Professional Occupations', fontsize=20);
Before dropping Other and Professional occupations from the dataframe, we observe the plot above showing a normal distribution and its potential influence to BorrowRate. The peak at ~32% would appear to the usual BorrowRate provided by Prosper.
fig, axes = plt.subplots(1, 1, figsize=(10, 3))
sb.despine(fig)
g0 = sb.boxplot(data=sample_misc_loans,
x='BorrowerRate', y='EmploymentStatus', hue='Occupation')
g0.set(xticks=np.arange(0, loans_df.BorrowerRate.max() +
0.04, 0.02), xlabel='Borrow Rate %')
g0.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
fig.suptitle('BorrowRate for "Other and Professional Occupations"');
The next category of interest is the Incomes of the respective Occupations above as well its overall spread.
loans_df.IncomeRange.value_counts(dropna=False)
$50,000-74,999 23696 $25,000-49,999 21422 $100,000+ 13980 $75,000-99,999 13547 $1-24,999 3578 $0 1 Name: IncomeRange, dtype: int64
loans_df.ProsperRatingNum.dtypes
dtype('float32')
# initialize
fig, axes = plt.subplots(2, 1, figsize=(12, 12))
sb.despine(fig)
# graph 1 - distribution of Income range
g0 = sb.histplot(data=loans_df, # new sample data frame above
x='IncomeRange', element='bars', multiple='stack', ax=axes[0])
g0.set(yticks=np.arange(0, 30000, 2000))
xplot_values(g0, '{:1.0f}')
# graph 2 - distribution of Income range
g1 = sb.violinplot(data=loans_df,
x='BorrowerRate', y='IncomeRange', ax=axes[1])
plt.setp(g1.collections, alpha=0.7)
# graph 3
g1 = sb.boxplot(data=loans_df,
x='BorrowerRate', y='IncomeRange', boxprops=dict(alpha=0.8), ax=axes[1])
g1.set(xticks=minTickBorrow, xlabel='Borrow Rate %')
plt.suptitle('Income Range', fontsize=20);
From the plots above we can see a left skewed distribution, with the majority of incomes ranging almost equally between \$25,000-49,999 and \$50,000-75,000 .
The plot of BorrowerRate against IncomeRange reveals a pattern reinforcing the idea that Income affects the BorrowerRate provided by Prosper, explicitly that the higher your income guarantees you will have a lower borrow rate on your loan.
Occ_order = ['Computer Programmer', 'Analyst',
'Executive', 'Investor', 'Sales - Commission',
'Accountant/CPA', 'Clerical', 'Administrative Assistant',
'Teacher', 'Student - College Freshman', 'Student - College Graduate Student',
'Judge', 'Attorney', 'Doctor', 'Engineer - Mechanical',
'Fireman', 'Police Officer/Correction Officer', 'Skilled Labor']
# re-use previous list of Occupations to produce a multi-variate plot
g0 = sb.FacetGrid(data=loans_df[sample_occp_mask],
# number of columns before starting a new row
col='Occupation', col_wrap=3, col_order=Occ_order,
sharex=False, height=3, aspect=2.5, xlim=(0, 0.4),
margin_titles=True)
g0.map_dataframe(sb.boxplot, x='BorrowerRate', y='IncomeRange')
g0.set(xticks=np.arange(0, loans_df.BorrowerRate.max() +
0.04, 0.02), xlabel='Borrow Rate %')
g0.fig.subplots_adjust(top=0.9) # adjust offset
plt.suptitle('Occupations - BorrowerRate vs IncomeRange', fontsize=20);
From the multi-variates, it is obvious that the higher income ranges, generally results in lower BorrowerRates. The only exception to these are Doctors, Clerical, Investors and Student - College Graduate Students. It is interesting to see how there are some loans approved despite being listed for having No income otherwise considered as \$0 as shown above.
loans_df.Term.value_counts()
36 51926 60 22884 12 1414 Name: Term, dtype: int64
# initialize
fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)
sb.despine(fig)
# set ticks
mintickMLP = np.arange(0, loans_df.MonthlyLoanPayment.max()+200, 200)
# graph 1 - distribution of borrow rates
g0 = sb.histplot(loans_df, x='MonthlyLoanPayment', element='bars', multiple="stack",
ax=axes[0])
# graph 2
axes[1].set(yscale="log")
g1 = sb.histplot(loans_df,
x='MonthlyLoanPayment', element='bars', multiple="stack",
ax=axes[1])
g1.set(xticks=mintickMLP, xlabel='Monthly Loan Payment ($)')
plt.suptitle('MonthlyLoanPayment', fontsize=20);
From the above histogram plots we are able to see a heavily right tailed distribution, where the logarithmic scaling reveals the outliers ~\$1500 per month and above, the bulk of the customer repayments ranging around \$150 per month.
# set figure to have major grid lines
# provide gridlines at major ticks
sb.set_style("whitegrid", {'grid.linestyle': '--'})
fig, axes = plt.subplots(1, 2, figsize=(14, 10))
sb.despine(fig)
g0 = sb.boxplot(x=loans_df['MonthlyLoanPayment'], y=loans_df[sample_occp_mask].Occupation, # sample list of occupations
hue=loans_df['Term'], ax=axes[0])
g0.set(xticks=mintickMLP, xlabel='Monthly Loan Payment ($)')
g1 = sb.violinplot(y=loans_df['MonthlyLoanPayment'],
x=loans_df.Term, ax=axes[1])
g1 = sb.boxplot(y=loans_df['MonthlyLoanPayment'],
x=loans_df.Term, ax=axes[1])
g1.set(yticks=mintickMLP, ylabel='Monthly Loan Payment ($)')
fig.suptitle('Occupations - Monthly Repayment vs Term', fontsize=20);
The table summary, violin & box plot above reveals the majority of Prosper clients requiring loans prefer 36 month/3 year terms.
The plot reveals a significant quantity of outliers within 12 month/1 year terms, afterwards the outliers gradually decrease as the terms increase. The median monthly repayments range between as low as $100 to as high as $400 across all Occupations.
Delving deeper into the spread within each of the sample occupations mentioned earlier, we can see that Doctors have a large IQR (interquartile range) indicating a large population of doctors have loan repayments of substantial variance when taking out a 12 month/1 year loan, potentially indicating the professions potential buying power, spending habits/capital investments required as part of there profession, i.e. dental equipment amongst other various factors speculating if these Doctors start their own practice and need the large capital investment.
The graph below is the previously plotted graph as seen on the pair plot at the start of section 2. EDA, however this graph includes the term variable to reveal the distribution of repayments amounts based on the terms chosen.
The plot below further supports the statement above where the people prefer the 36 month/3 year term, it reveals the spread of individuals with there respective loan repayments vs the borrower rate they have been given by Prosper. The 36 month/3 year term is significantly dominant soon followed by the 60 month/5 year then 12 month/1 year.
loans_df.head(5)
| EmploymentStatus | BorrowerState | Occupation | ProsperRating (Alpha) | ProsperRating (numeric) | IncomeRange | Term | CurrentDelinquencies | BorrowerRate | DebtToIncomeRatio | ListingCreationDate | ClosedDate | MonthlyLoanPayment | ListingCreationYear | ListingCreationMonth | ClosedDateYear | CreditRating | ProsperRatingNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Employed | CO | Professional | A | 6.0 | $50,000-74,999 | 36 | 0.0 | 0.0920 | 0.18 | 2014-02-27 08:28:07.900 | NaT | 318.93 | 2014 | 2 | NaN | A | 6.0 |
| 3 | Employed | GA | Skilled Labor | A | 6.0 | $25,000-49,999 | 36 | 4.0 | 0.0974 | 0.15 | 2012-10-22 11:02:35.010 | NaT | 321.45 | 2012 | 10 | NaN | A | 6.0 |
| 4 | Employed | MN | Executive | D | 3.0 | $100,000+ | 36 | 0.0 | 0.2085 | 0.26 | 2013-09-14 18:38:39.097 | NaT | 563.97 | 2013 | 9 | NaN | D | 3.0 |
| 5 | Employed | NM | Professional | B | 5.0 | $100,000+ | 60 | 0.0 | 0.1314 | 0.36 | 2013-12-14 08:26:37.093 | NaT | 342.37 | 2013 | 12 | NaN | B | 5.0 |
| 6 | Employed | KS | Sales - Retail | E | 2.0 | $25,000-49,999 | 36 | 0.0 | 0.2712 | 0.27 | 2013-04-12 09:52:56.147 | NaT | 122.67 | 2013 | 4 | NaN | E | 2.0 |
g0 = sb.jointplot(data=loans_df, x='BorrowerRate', y='MonthlyLoanPayment', hue='Term',
palette=['red', 'green', 'blue'], kind='kde', alpha=0.75, height=10, ratio=4, xlim=(0, 0.38), ylim=(-50, 2400))
g0.set_axis_labels(xlabel='Borrow Rate %', ylabel='MonthlyLoanPayment ($)')
# jointplot tick increments
g0.ax_joint.xaxis.set_major_locator(ticker.MultipleLocator(0.02))
g0.ax_joint.yaxis.set_major_locator(ticker.MultipleLocator(200))
g0.fig.subplots_adjust(top=0.95) # adjust offset
g0.fig.suptitle('Repayments - BorrowerRate vs Term', fontsize=20);
C:\Users\jcala\anaconda3\lib\site-packages\seaborn\distributions.py:1182: UserWarning: No contour levels were found within the data range. cset = contour_func(
CreditRating is a combination of ProsperRating & CreditGrade
loans_df['CreditRating'].value_counts(dropna=False)
C 16501 B 14379 A 13491 D 12631 E 8443 HR 5682 AA 5097 Name: CreditRating, dtype: int64
# define figure and num. axes
sb.set_style("whitegrid", {'grid.linestyle': '--'})
sb.despine(fig)
g0 = sb.jointplot(data=loans_df, x='BorrowerRate', y='MonthlyLoanPayment', hue=loans_df['CreditRating'], hue_order=ordinal_dict['CreditGrade'],
kind='scatter', height=10, ratio=5, space=0.3, xlim=(0, 0.38), ylim=(-50, 2400))
g0.set_axis_labels(xlabel='Borrow Rate %', ylabel='MonthlyLoanPayment ($)')
# change x/y axis values
g0.ax_joint.xaxis.set_major_locator(ticker.MultipleLocator(0.02))
g0.ax_joint.yaxis.set_major_locator(ticker.MultipleLocator(200))
g0.fig.subplots_adjust(top=0.9) # adjust offset
g0.fig.suptitle('CreditRating - BorrowerRate vs Term', fontsize=20);
The above joint plot supports the ordinal data hierarchy evident with an AA rating receiving the lowest BorrowerRates and HR i.e. high risk ratings receiving the highest. As highlighted during the plotting of the numerical variables, there is a high correlation between ProsperRating (numeric) i.e. a numerical rating of CreditGrade and ProsperRating (Alpha).
The bivariate kde plot of MonthlyLoanPayments is difficult to view and is plotted below, showing similar distributions for A, B & C ratings. All the graphs apart from the HR graph are shown to have right tailed distributions supporting the previous findings of clients having lower monthly repayments. The only rating that is not aligned with the others is the HR rating.
sb.set_style("whitegrid", {'grid.linestyle': '--'})
g0 = sb.FacetGrid(data=loans_df[sample_occp_mask], col='CreditRating', col_wrap=3,
col_order=ordinal_dict['CreditGrade'], sharex=False, xlim=(0, 1300),
height=5, aspect=1.5)
g0.map(sb.histplot, 'MonthlyLoanPayment', kde=True)
g0.set(xticks=np.arange(0, 1400, 100), xlabel='Monthly Loan Payment ($)', # values trimmed from 2000 to 1000
yticks=np.arange(0, 1100, 50)) # prior to filter 2000 was the recorded max
g0.fig.subplots_adjust(top=0.9) # adjust offset
plt.suptitle('Monthly Loan Payment Distribution of each Rating', fontsize=20);
fig, axes = plt.subplots(1, 1, figsize=(10, 10))
sb.despine(fig)
# Using the graph from the
g0 = sb.violinplot(x=loans_df['ProsperRating (numeric)'],
y=loans_df[sample_occp_mask].Occupation)
g0 = sb.boxplot(x=loans_df['ProsperRating (numeric)'],
y=loans_df[sample_occp_mask].Occupation)
plt.setp(g0.collections, alpha=0.7) # seaborn change transparency
g0.set(xticks=np.arange(0, 10, 1))
fig.subplots_adjust(top=0.9) # adjust offset
fig.suptitle('', fontsize=20);
Occ_order = ['Computer Programmer', 'Analyst',
'Executive', 'Investor', 'Sales - Commission',
'Accountant/CPA', 'Clerical', 'Administrative Assistant',
'Teacher', 'Student - College Freshman', 'Student - College Graduate Student',
'Judge', 'Attorney', 'Doctor', 'Engineer - Mechanical',
'Fireman', 'Police Officer/Correction Officer', 'Skilled Labor']
g0 = sb.FacetGrid(data=loans_df[sample_occp_mask], col='Occupation',
col_wrap=4, col_order=Occ_order, # height=2,
aspect=1.5, sharex=False, margin_titles=True)
g0.map_dataframe(sb.boxplot, 'ProsperRatingNum', 'EmploymentStatus');
In the above occupations we can see the Prosper Ratings ranging between 3 and 5, which corresponds to a B, C and D rating. Doctors have the same IQR across the various employment status, Employed Judges has best box plot across across all occupations and employment types.
loans_df.BorrowerState.value_counts().reset_index()
| index | BorrowerState | |
|---|---|---|
| 0 | CA | 9575 |
| 1 | NY | 5184 |
| 2 | TX | 5086 |
| 3 | FL | 4739 |
| 4 | IL | 3871 |
| 5 | OH | 3061 |
| 6 | GA | 3006 |
| 7 | VA | 2530 |
| 8 | NJ | 2466 |
| 9 | PA | 2417 |
| 10 | MI | 2349 |
| 11 | NC | 2194 |
| 12 | MD | 2066 |
| 13 | WA | 1925 |
| 14 | MA | 1696 |
| 15 | MO | 1591 |
| 16 | MN | 1554 |
| 17 | CO | 1539 |
| 18 | IN | 1518 |
| 19 | TN | 1396 |
| 20 | CT | 1371 |
| 21 | WI | 1359 |
| 22 | AZ | 1237 |
| 23 | AL | 1085 |
| 24 | OR | 1045 |
| 25 | NV | 916 |
| 26 | SC | 872 |
| 27 | KY | 804 |
| 28 | KS | 775 |
| 29 | LA | 768 |
| 30 | AR | 689 |
| 31 | OK | 665 |
| 32 | MS | 601 |
| 33 | NE | 496 |
| 34 | UT | 441 |
| 35 | NH | 410 |
| 36 | RI | 376 |
| 37 | ID | 348 |
| 38 | DC | 310 |
| 39 | HI | 307 |
| 40 | NM | 293 |
| 41 | WV | 274 |
| 42 | DE | 253 |
| 43 | MT | 192 |
| 44 | SD | 165 |
| 45 | VT | 151 |
| 46 | AK | 150 |
| 47 | WY | 108 |
As borrower states are numerous only the top 6 states will be analysed.
# top X states, 9 picked as default as it provides a 3x3 grid
states = 9 # user to change
StateTopN = loans_df.BorrowerState.value_counts().head(states)
StateTopN.reset_index()
| index | BorrowerState | |
|---|---|---|
| 0 | CA | 9575 |
| 1 | NY | 5184 |
| 2 | TX | 5086 |
| 3 | FL | 4739 |
| 4 | IL | 3871 |
| 5 | OH | 3061 |
| 6 | GA | 3006 |
| 7 | VA | 2530 |
| 8 | NJ | 2466 |
state_N_mask = loans_df['BorrowerState'].isin(StateTopN.index)
loans_df[state_N_mask].sample(5)
| EmploymentStatus | BorrowerState | Occupation | ProsperRating (Alpha) | ProsperRating (numeric) | IncomeRange | Term | CurrentDelinquencies | BorrowerRate | DebtToIncomeRatio | ListingCreationDate | ClosedDate | MonthlyLoanPayment | ListingCreationYear | ListingCreationMonth | ClosedDateYear | CreditRating | ProsperRatingNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18795 | Full-time | NY | Laborer | D | 3.0 | $1-24,999 | 36 | 0.0 | 0.2199 | 1.01 | 2011-05-04 11:29:06.270 | 2014-02-20 | 229.11 | 2011 | 5 | 2014.0 | D | 3.0 |
| 29182 | Employed | NY | Nurse (LPN) | HR | 1.0 | $25,000-49,999 | 36 | 0.0 | 0.3134 | 0.49 | 2013-09-30 08:47:30.377 | NaT | 172.76 | 2013 | 9 | NaN | HR | 1.0 |
| 63690 | Other | NY | Other | HR | 1.0 | $25,000-49,999 | 36 | 0.0 | 0.3177 | 0.06 | 2012-12-08 22:01:21.020 | NaT | 173.71 | 2012 | 12 | NaN | HR | 1.0 |
| 101013 | Employed | GA | Medical Technician | C | 4.0 | $25,000-49,999 | 60 | 0.0 | 0.2024 | 0.36 | 2013-04-08 18:41:02.530 | NaT | 266.28 | 2013 | 4 | NaN | C | 4.0 |
| 39604 | Employed | TX | Professional | E | 2.0 | $100,000+ | 36 | 0.0 | 0.2859 | 0.10 | 2012-08-29 10:46:59.473 | NaT | 83.37 | 2012 | 8 | NaN | E | 2.0 |
fig, axes = plt.subplots(1, 1, figsize=(5, 3))
sb.despine(fig)
g0 = sb.histplot(loans_df[state_N_mask],
x='BorrowerState', ax=axes)
xplot_values(g0, '{:1.0f}')
g0 = sb.FacetGrid(data=loans_df[state_N_mask], col='BorrowerState',
col_wrap=3, # number of columns before starting a new row
height=5,
aspect=1.5, sharex=False, margin_titles=True)
g0.map_dataframe(sb.histplot, 'BorrowerRate', kde=True)
g0.set(xticks=np.arange(
0, loans_df['BorrowerRate'].max()+.04, 0.02), xlabel='Borrow Rate %')
g0.fig.subplots_adjust(top=0.9) # adjust offset
plt.suptitle('BorrowerState vs State', fontsize=20);
sb.set_style("whitegrid", {'grid.linestyle': '--'})
g1 = sb.FacetGrid(data=loans_df[state_N_mask], col='BorrowerState',
col_wrap=3, # number of columns before starting a new row
height=3,
aspect=2, sharex=False, margin_titles=True, xlim=(0, 1600))
g1.map_dataframe(sb.histplot, 'MonthlyLoanPayment', kde=True)
g1.set(xticks=np.arange(0, loans_df.MonthlyLoanPayment.max()+200, 200), xlabel='Monthly Loan Payment ($)')
# figure title format
g1.fig.subplots_adjust(top=0.9)
g1.fig.suptitle('Top 9 States Monthly Loan Repayment', fontsize=20);
Of the top 9 states requiring loans, CA appears to have requested the most loans with VA being the lowest of the 9 states.
Analysis of numerical variable against categoricals
loans_df.DebtToIncomeRatio.value_counts()
0.18 3177
0.22 2974
0.17 2716
0.14 2675
0.21 2543
...
5.55 1
5.64 1
2.19 1
3.29 1
3.66 1
Name: DebtToIncomeRatio, Length: 258, dtype: int64
categories
['EmploymentStatus', 'BorrowerState', 'Occupation', 'CreditRating', 'CreditGrade', 'IncomeRange']
sb.set_style("whitegrid", {'grid.linestyle': '--'})
# subplots(row, col)
fig, axes = plt.subplots(2, 1, figsize=(8, 4))
sb.despine(fig)
ticksDIR = np.arange(0, loans_df.DebtToIncomeRatio.max()+1, 1)
# graph 1
g0 = sb.histplot(data=loans_df, x='DebtToIncomeRatio', ax=axes[0])
g0.set(xticks=ticksDIR)
# graph 2
axes[1].set(yscale="log")
g1 = sb.histplot(data=loans_df, x='DebtToIncomeRatio', ax=axes[1])
g1.set(xticks=ticksDIR);
From the logarithmic transformation above, we can see several plots past the ratio of 1, indicating there debts are greater then there available income. It would interesting to see what the ProsperRatingNum as well as the Borrow Rate are for these individuals including there Income Range.
# new dataframe
# filter dataframe by DIR >1
loans_DIR_filter = loans_df.copy()[loans_df['DebtToIncomeRatio'] > 1]
loans_DIR_filter.sample(5)
| EmploymentStatus | BorrowerState | Occupation | ProsperRating (Alpha) | ProsperRating (numeric) | IncomeRange | Term | CurrentDelinquencies | BorrowerRate | DebtToIncomeRatio | ListingCreationDate | ClosedDate | MonthlyLoanPayment | ListingCreationYear | ListingCreationMonth | ClosedDateYear | CreditRating | ProsperRatingNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30969 | Employed | KS | Student - College Senior | D | 3.0 | $1-24,999 | 36 | 0.0 | 0.2346 | 1.10 | 2012-09-06 15:03:34.407 | NaT | 89.58 | 2012 | 9 | NaN | D | 3.0 |
| 95637 | Employed | IN | Skilled Labor | D | 3.0 | $1-24,999 | 60 | 0.0 | 0.2870 | 5.76 | 2012-02-29 12:40:52.963 | NaT | 315.59 | 2012 | 2 | NaN | D | 3.0 |
| 40510 | Full-time | FL | Teacher | C | 4.0 | $25,000-49,999 | 36 | 0.0 | 0.2980 | 5.56 | 2010-04-05 19:39:44.687 | 2012-11-09 | 127.03 | 2010 | 4 | 2012.0 | C | 4.0 |
| 98716 | Employed | FL | Nurse (RN) | HR | 1.0 | $1-24,999 | 36 | 0.0 | 0.3177 | 2.74 | 2012-05-24 09:00:19.187 | NaT | 173.71 | 2012 | 5 | NaN | HR | 1.0 |
| 74752 | Other | FL | Other | A | 6.0 | $1-24,999 | 60 | 0.0 | 0.1728 | 1.16 | 2012-07-21 13:32:07.023 | NaT | 125.02 | 2012 | 7 | NaN | A | 6.0 |
loans_DIR_filter.shape
(322, 18)
g2 = sb.jointplot(data=loans_DIR_filter, x='BorrowerRate', y='DebtToIncomeRatio',
hue='CreditRating', xlim=(0, 0.36), ylim=(-0.5, 10.4))
g2.set_axis_labels(xlabel='Borrow Rate %');
From the above jointplot, for the clients with a debt to income ratio greater then 1, we can see there are a mix of all types of rating Credit ratings. We can confirm that the variable has no effect on Borrower Rate as a result.
loans_df.shape
(76224, 18)
loans_df.ClosedDate.value_counts(dropna=False)
# There are a significant amount of NaT, i.e. loans not yet finished.
NaT 53204
2014-03-04 97
2014-02-11 84
2014-02-19 83
2014-01-14 71
...
2011-02-12 1
2011-05-28 1
2011-10-02 1
2011-08-20 1
2010-08-06 1
Name: ClosedDate, Length: 1424, dtype: int64
# graph 1
g0 = sb.FacetGrid(data=loans_df, col='ListingCreationYear',
col_wrap=3, # number of columns before starting a new row
height=3,
xlim=(0, 0.38),
aspect=2.5, sharex=False, margin_titles=True)
g0.map_dataframe(sb.histplot, x='BorrowerRate')
g0.set(xticks=np.arange(0, 0.38, 0.02), xlabel='Borrow Rate %')
g0.fig.subplots_adjust(top=0.9)
g0.fig.suptitle('Listing Creation Year', fontsize=20)
# graph 2
g1 = sb.FacetGrid(data=loans_df, col='ClosedDateYear',
col_wrap=3, # number of columns before starting a new row
height=3,
aspect=2.5, sharex=False, margin_titles=True)
g1.map_dataframe(sb.histplot, x='BorrowerRate')
g1.set(xticks=np.arange(0, 0.38, 0.02), xlabel='Borrow Rate %',
ylabel='MonthlyLoanPayment ($)')
g1.fig.subplots_adjust(top=0.9)
g1.fig.suptitle('Closed Date Year', fontsize=20);
From the above histogram plots faceted based on years, we can see that 2013 was the year with the most loans closed which makes sense for the following reasons: